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Abstract 

Path delays in IP networks are important metrics, required by network operators for assessment, 
planning, and fault diagnosis. Monitoring delays of all source-destination pairs in a large network is 
however challenging and wasteful of resources. The present paper advocates a spatio-temporal Kalman 
filtering approach to construct network-wide delay maps using measurements on only a few paths. The 
proposed network cartography framework allows efficient tracking and prediction of delays by relying 
on both topological as well as historical data. Optimal paths for delay measurement are selected in an 
online fashion by leveraging the notion of submodularity. The resulting predictor is optimal in the class 
of linear predictors, and outperforms competing alternatives on real-world datasets. 

Index Terms 

Internet measurements, network kriging, kriged Kalman filter, delay prediction, submodularity 
optimization. 



Submitted April 19, 2012; revised March 3, 2013. 

The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, 200 Union Street SE, 
Minneapolis, MN 55455, USA. Tel/fax: +1(612)624-9510/625-2002. E-mails: {ketan, emiliano, georgios}@umn . edu 
* Corresponding author. 

Work in this paper was supported by NSF-ECCS grant no. 1202135. Part of this paper has been presented at the IEEE 
Statistical Signal Processing Workshop, Ann Arbor, MI, Aug. 2012. 



March 3. 2013 



DRAFT 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 2 

I. Introduction 

The explosive growth in network traffic volumes has necessitated the development of avant- 
garde monitoring tools to endow network operators with a comprehensive view of the global 
network behavior. However, acquisition and processing of network-wide performance metrics 
for large networks is no easy task. For instance, monitoring path metrics such as delays or loss 
rates is challenging primarily because the number of paths generally grows as the square of the 
number of nodes in the network. Therefore, measuring and storing the delays of all possible 
source-destination pairs is hard in practice even for moderate-size networks. 

Focus has thus shifted towards statistical means of predicting network-wide performance 
metrics using measurements on only a subset of nodes [1], [2]. A promising approach in this 
context has been the application of kriging, a tool for spatial prediction popular in geostatistics 
and environmental sciences [3], [4]. A network kriging approach was developed in [5], where 
network-wide path delays were predicted using measurements on a chosen subset of paths. 
The class of linear predictors introduced leverages network topology information to model 
the covariance among path delays. This is accomplished in [5] by assigning higher correlation 
between two paths if they share several links, as in this case, they are expected to incur similar 
delay variations. 

The present paper puts forth a dynamic network kriging approach capable of real-time spatio- 
temporal delay predictions. Specifically, a kriged Kalman filter (KKF) is employed to explicitly 
capture variations due to queuing delays, while retaining the topology-based kriging predictor. 
The resulting dynamic network kriging approach not only yields lower prediction error, but is 
also more flexible, allowing delay measurements to be taken on random subsets of paths. In this 
context, the problem of choosing the optimal paths for delay measurements is also considered. 
Since the KKF runs in real-time, the paths are also selected in an online fashion by minimizing 
the prediction error per time slot. Interestingly, the resulting combinatorial optimization problem 
is shown to be submodular, and is therefore solved approximately via a greedy routine. 

Recently, a compressive sampling-based approach has also been reported for predicting network- 
wide performance metrics [6], [7]. For instance, diffusion wavelets were utilized in [6] to obtain 
a compressible representation of the delays, and account for spatial and temporal correlations. 
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Although this allows for enchanced prediction accuracy over [5], it requires batch processing of 
measurements which does not scale well to large networks for real-time operation. In contrast, 
both the KKF and the greedy path selection algorithms entail sequential operations, and are 
therefore significantly faster. 

Imputation of end-to-end delays has also been considered in the context of Internet geolocation. 
Treating end-to-end delays as distances between nodes, all-pair node distances are estimated using 
Euclidean embedding [8], or, matrix factorization [9]. However, these approaches do not exploit 
the temporal or topological information, since their focus is not on monitoring or extrapolation 
(that is, prediction) of delays. 

The rest of the paper is organized as follows. Sec. II introduces the model and the problem 
statement. Sec. Ill deals with the KKF approach, while Sec. ffi-A describes techniques for 
estimating the relevant parameters. Finally, empirical validation of KKF and comparisons with 
the Kriging approach of [5] are provided in Sec. V. 

Notation. Lower case symbols with indices, such as y p , represent scalar variables. These 
variables, when stacked over their indices are denoted through their bold-faced versions y. 
Bold-faced upper case symbols (S) represent matrices. Regular upper case symbols (S) represent 
constant scalars, and typically stand for the cardinality of the set represented by corresponding 
calligraphic upper case symbol (S). Identity matrix of size P x P is denoted by I P , , and its 
columns by ei, e 2 , . . ., e P . Matrix C y denotes the covariance matrix of the vector y. 

II. Modeling and Problem Statement 

Consider an IP network modeled by a connected digraph Q = (V, £), with V denoting the set 
of nodes (devices, servers, or routers), and £, the communication links. The issue is to monitor 
path delays on a set of multi-hop paths V that connect the P := \V\ source-destination pairs. 
Latency measured on path p E V at time t is denoted by y p (t), and all such network- wide 
delays are collected in the vector y(t). At any time t however, delay can only be measured 
on a subset of paths S(t) C V, which is represented by y s (t). Based on such partial current 
and past measurements H(t) := {y s (r)}^ =1 , the goal is to predict the remaining path delays 
ys(t) := {y P (t)} pe r\s(t) for each t. 
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The per-path end-to-end delay y p (t) consists of several independent components corresponding 
to contributions from each intermediate link and router. Of these, the queuing delay x P (t) is the 
time spent by the packets waiting in the queues of intermediate buffers, and depends on the 
traffic volumes in competing links. Network traffic is not only correlated spatio-temporally, 
but also exhibits non-stationarities, in the form of random fluctuations and bursts [10]. Indeed, 
it is not surprising that the statistical properties of queueing delays in large IP networks are 
largely unknown. In the interests of model parsimony and amenability to the tools used later, 
the following random-walk model is instead adopted for the latent vector of queuing delays, 

X(t) = x(t-l) + r 1 (t) (1) 

where rj(t) denotes state noise with zero mean, and covariance matrix C v := E [r](t)rj T (t)] . 
Observe that the random-walk model has very few tuning parameters, compared to say, a model 
which includes a non-identity state transition matrix (i.e., x(t) = B\(t — 1) + T)(t)). Further 
advantages of the random-walk model, including those pertaining to the computational cost, are 
provided in later sections. 

Other components of the path delay, combined in the nonzero-mean random v p (t), include the 
propagation, processing, and transmission delays, which are temporally uncorrelated (see e.g., 
[11] for details). This component of delays is however spatially correlated across paths, and the 
covariance matrix of the compacted vector u(t) is given by C„. Finally, the measurement of 
path delays using software tools such as ping itself introduces errors e p (t), which are assumed 
zero mean, uncorrelated over time and across paths, with covariance a 2 := E \e p {t)e^{t)\. 

The measured delays are expressed as 

y P (t) = X P (t) + v P (t) + e p (t) p G S(t). 

Letting S(t) denote the \S(t)\ x P selection matrix with 0-1 entries that contains the p-th. row 
of lp if p G S(t), the measurement equation can be compactly written as 

y s (t) = S(t)±(t) + u s (t) + e s (t) (2) 

where the vector e s (t) collects the measurement errors on paths p G S(t), and u s {t) := S(t)u(t). 

The next section describes a KKF approach for tracking and predicting the unknown end-to- 
end delays y s (t), by utilizing the state-space model described by (1) and (2). 
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III. Dynamic network kriging 

The spatio-temporal model in (l)-(2) is widely employed in geostatistics and environmental 
science, where x(t) is generally referred to as trend, and u{t) captures random fluctuations 
around x(t); see e.g., [3, Ch. 4], [12], [13]. Recently, a similar modeling approach was employed 
by [14] to describe the dynamics of wireless propagation channels, and in [15] for spatio-temporal 
random field estimation. In order to better relate the proposed model with the existing ones, 
the mean of u{t) is incorporated in the trend, and (2) is now replaced with 

y s (t) = S(t) X (t) + i/ s (t) + e a (t) (3) 

where u s (t) := v s (t) — E [v s {t)\ and x(t) '■= x(t) + ^ and likewise for v(t). Next, given 

only first- and second-order moments of r)(t), e s (t), and v(t), this section derives the best linear 
predictor for the unavailable path delay vector ys(t). 

Suppose first that the queuing delay vector x(t) is known, and let S(t) denote an \S(t)\ x 
P matrix containing the p-th row of Ip if p e S(i); that is, S(t) is a path selection matrix 
which returns quantities pertaining to paths in S(t). Then, the linear minimum mean-square 
error (LMMSE) estimator (denoted by E* [.]) for v s (t) is given by (see, e.g. [16]) 

E* [u s (t)\ X (t),y s (t)] = S(t)C„S T (t) (S(t)C„S T (t) + ( x%)- 1 

x [y s (t)-S(t) X (t)} (4) 

and is commonly referred to as kriging [4]. In practice however, the trend x(t) nas to be 
estimated from the data. In the so-termed universal kriging predictor [3], x(t) is estimated using 
the generalized least-squares (GLS) criterion, where v s (t) is treated as noise (lumped together 
with e s (t)). The prediction for v s (t) is then obtained by replacing x(t) in (4) with its estimate. 
This approach was proposed for network delay prediction in [5], and was referred to as network 
kriging. However, since the trend is estimated independently using GLS per time slot, its temporal 
dynamics present in (1) are not exploited. 

From the spatio-temporal model set forth in Sec. II, it is clear that estimating the trend x(t) can 
benefit from processing both present and past measurements jointly. Towards this end, the Kalman 
filtering (KF) machinery offers a viable option for tracking the evolution of x(t) from the set of 
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historical data H(t). At each time t, the KF finds the LMMSE estimate x(t) := E* [x(t)\n(t)], 
and its error covariance matrix M(£) := E [(x(t) — x(t))(x(t) — x(t)) T ] using the following 
set of recursions (see e.g., [16, Ch. 3]) 

X(t) = X(t - 1) + K(t)(y S (t) - S(t) X (t - 1)) (5a) 
M{t) = (I P - K(t)S(f))(M(f - 1) + C„) (5b) 

where the so-termed Kalman gain K(£) is given by 

K(t) := (M(t - 1) + C v )S T (t) 

x [S(*)(C„ + C V + M(t - l))S T (t) + <7 2 l5] 1 . (6) 

Once x(t) has been estimated via KF, Vg(t) can be readily obtained via kriging as in (4), yielding 
the predictor 

y- s (t) = S(t)x(t) + S(t)C„S T (t) (S(t)C„S T (t) + a%y l 

x [y s (t) - S(t)x(t)\ ■ (7) 

The predictor in (7) constitutes what is also referred to as the kriged Kalman filter [12], [13]. 
The LMMSE framework employed here yields the best linear predictor even for non-Gaussian 
distributed noise. The prediction error of the KKF is characterized in the following proposition, 
whose proof is provided in Appendix A. 

Proposition 1. The prediction error covariance matrix at time t is given by 

Mf (t) := E{(y s -(0 - y,-(t))(y a -(f) - y s -(t)f} (8a) 



(M(t - 1) + c„ + eg- 1 + ls T (t)s(t) 



S T (t) . (8b) 



Having a closed-form expression for the prediction error will come handy for selecting the 
matrix S(t), as shown later in Sec. IV. 

The KF step also allows r-step prediction for r > 1, which is given by y(t + r) = x(t), since 
the kriging term is temporally white. In the present context, this can be useful in preemptive 
routing and congestion control algorithms, as well as for extrapolating missing measurements. 
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In the latter case, the covariance matrix is updated simply as M(£) = M(£ — 1) + C v . Before 
concluding the description of the KKF, the following remarks are due. 

Remark 1. The random walk model adopted in (1) may result in an unstable filter. Operationally, 
if the KKF is unstable, an incorrect initialization of M(0) or %(0) may result in poor prediction 
performance even as t — > oo. This can be remedied by adopting a damped model x(t) — 
bx(t — 1) + with b < 1. Here, %(£) is a zero-mean random process which does not 
incorporate the mean of v(t). The mean delay of all paths should instead be estimated a priori, 
and subtracted from the measurements themselves, so that each component of the path delay 
in (3) is zero-mean. With this modification, the results in this paper can be generalized to the 
damped case. The random walk model is nevertheless used here since no instability issues were 
observed in the two data sets considered in Sec. V. An alternative formulation, developed along 
the lines of [17], can also be used in the AR(1) case. This technique may however increase 
the number of state- space parameters, and considerably complicate the expressions developed in 
Sec. IV. 

Remark 2. A distributed implementation of the KKF may be desirable for enhancing the 
robustness and scalability of delay monitoring. In large-scale networks, a distributed algorithm 
also mitigates the message passing overhead required to collect all measurements at a fusion 
center. If the model covariances C„ and C v are globally known, and the selection matrix S(t) 
is constant for all t, a distributed implementation of (5) can be derived along the lines of [18]. 
To this end, notice that substituting (5a) in (7), one can re-write the KKF predictor as 

Ut) = m \F(t) - F(f)S(f)K(f) + K(f)] y s (t) + S(t)x(t - 1) 

+ S(f) [F(f)S(f)K(f) _ K (t) - F(f)] S(t)x(t - 1) (9) 

where F(t) := C u S T (t) (S(t)C„S T (t) + a 2 I s y\ With X (t ~ 1) available from the previous 
iteration, it is clear from (9) that if d(t) := [F(t) - F(t)S(t)K(t) + K(t)] y s (t) were available 
at each node of the network, the KKF predictor (7) could be performed locally at each node. 
Assume that measurements are collected at a sub-set of nodes V s C V, and node v E V s 
measures delays of the set of paths S v C S; that is, v is the end-node of all the paths in S v . 
Then, to compute d(t) in a distributed manner, consider rewriting it as a sum of |V S | terms, each 
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involving only the local measurements y SjV (t) := [{y p \p G S V }] T . Next, collect in the P x \S V \ 
matrix & v (t), the columns of matrix F(t) — F(£)S(t)K(£) + K(£) corresponding to the paths in 
S v . Then, d(t) can be expressed as d(t) = ^2 veV &v(t)ys,v(t), which is equivalent to [19] 



where d v (t) represents a local copy of d(t) at node v, and V s C V s is the set of nodes 
communicating with v. Building on (10), an iterative consensus algorithm whereby each node v 
exchanges its local copy d v (t) only with nodes in V s , can be derived by employing the so called 
alternating direction method of multipliers as detailed in [19] and [18]. Notice that, since the 
model covariances are globally known, recursions (8a) can be performed locally at each node. 

A. Estimating model parameters 

The LMMSE-optimal dynamic kriging framework described in Sec. Ill requires knowledge of 
model covariance matrices C u , u 2 ls, and C v , to operate. Of these, a 2 depends on the precision 
offered by the measurement software, and can be safely assumed known a priori. 

The structure of C u is motivated by the modeling assumptions and utilizes topological infor- 
mation. Intuitively, propagation, transmission, and processing delays over paths p,q G V should 
be highly correlated if these paths share many links. This relationship can be modeled by utilizing 
the Gramian matrix G := RR T , where R is the P x \£\ path-link routing matrix; that is, the 
(p, Z)th element of R is 1 if path p G V traverses link I G £, and otherwise. Each off-diagonal 
entry (p, q) of G represents the number of links common to the paths p, q G V. On the other 
hand, the elements on the main diagonal of G count the number of constituent links per path. 
The covariance matrix of u(t) can therefore be modeled as C„ = 7G. 

A similar model for C u was adopted by [5], where it was motivated from the property that 
path delays are sum of link delays, that is, v(t) = Rx(£), where vector x(i) collects the link 
delays. Under this assumption, it holds that C u = 7G if the link delays are uncorrected across 
links, and have covariance matrix 7l|£|. Note that the KKF and path- selection techniques also 
work with a generic link-delay covariance matrix S, i.e., C„ = RSR T . Unfortunately however, 
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in most IP networks the link delays cannot be directly observed, which makes estimation of £ 
difficult, if not impossible. For example, consider a network (1-2-3) where two end terminals 
(nodes 1 and 3) are connected via an intermediate router (node 2). Clearly, the delays incurred 
by the individual links (1-2 and 2-3) cannot be discerned from each other, no matter how 
accurately the end-to-end delays (between 1 and 3) are measured. The same reasoning applies 
to the corresponding covariance matrices, irrespective of the estimation technique used. 

For the remaining parameters, namely 7 and C v , an empirical approach is described next. 
It entails a training phase, and a set of measurements {y s {t)}\=i collected at time slots t = 
1, . . . , t L . During the KKF operation, t L — 1 time slots can be periodically devoted to updating 
model covariances, while predicting the networks-wide delays y~s(t) for t = l,...,t L . Let 
Cjy(t) := 7(£)G and C v (t) denote the estimates of C„ and C v , respectively, at time t. Estimating 
the covariance matrix of the state noise is well-known to be a challenging task, primarily because 
X(t) and x(t — 1) are not directly observable. Furthermore, methods such as those in [20] are 
not applicable in the present context, as they require the KF to be time-invariant and stationary. 
As shown in [21], a viable means of estimating C v from {y s (t)}^ 1 relies on approximating the 
noise rj(t) as q(t) := — x(t — 1). Then, upon noticing that the resultant process {q(r)} is 
temporally-white, the sample mean and covariance of q can be obtained as 



Using (12), and exploiting the equality E{CJ = (t L - l) _1 Et( M (* ~ l ) ~ M W) + C v> il 
follows that an unbiased estimate of C v can be obtained as 



Finally, in order to obtain 7, consider the innovations at time t as i p (t) := y p (t) — x p (t — 1), 
and notice that if the model covariances are correct, then i p (t) is temporally white and zero-mean 
[20]. Indeed, it is possible to show that E [t p (t)t q (t)\ = [M(t - 1) + C v + C„] pq + a 2 for any 
p, q G S(t) [21]. Further, let T pq := {t\l < t < t L ,p,q £ <S(t)} be the set of time slots for 




(ID 




(12) 




(13) 
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which paths p and q are both measured. Then, the sample covariance between t p (t) and t q (t) 
is given by C pq := |7^| _1 J2teT pq L p(t) L q(t) f° r P a i fS Q e *P. Given M(t — 1) and a 2 , this 
observation yields the following estimate 

C„(t)l = p-Ur V ^K(0 - a 2 - [M(t - 1) + C 1| (*)] M . (14) 

L hq ^teT Pq 

Indeed, entries of C„(t) can be updated recursively using C u (t — 1) in (14). At each time, only 
a few entries are updated, depending on which paths are observed (cf. S{t)). 

Finally, 7(t) can be obtained by fitting C„(t) to 7G in the least-squares sense, which yields 

Wl) = ^ttt 2 • (15) 

As further justification for the random-walk model, it is remarked that a model of the form 
x(t) = Bx(t — 1) +'n{t) requires learning the entries of B. Since the state vector is not directly 
observed, estimation of B is usually significantly more difficult [16], [22], [23]. Such a model 
would also need a longer training phase, and may exhibit poor generalization performance if the 
amount of training data is limited [24]. This problem also arises when trying to use the model 
C u = RSR T , where additionally, S is not uniquely identifiable, as explained earlier. 

IV. Online Experimental Design 

This section considers the problem of optimally choosing the set of paths S(t) (equivalently, 
the matrix S(t)) so as to minimize the prediction error. To begin with, a simple case is considered 
where the set S(t) is allowed to contain any S paths. Operational requirements may however 
impose further constraints on S(t), and these are discussed later. 

The prediction error can be characterized by using a scalar function of Mf (t); see e.g., [25]. 
To this end, the so called D-optimal design is considered, where the goal is to minimize the 
function f(S(t)) := logdet(Mf (t)). The paths selected at time t are therefore given by the 
solution of the following optimization problem 

5*(t) = argnun/(5) (16) 

s. t. 151 = S. (17) 
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Clearly, tackling (16) incurs combinatorial complexity and is challenging to solve exactly, even 
for moderate-size networks. Indeed, (16) is an example of the so called subset selection problem, 
which is NP-complete in general; see e.g., [26] and references therein. 

Interestingly, it is possible to solve (16) approximately by utilizing the notion of submodularity. 
Consider a function g(S), which takes as input sets S C V. Given a set A E V and an element 
p £ V \ A, the increment function is defined as 8 9 A {p) := g(A U {p}) — g(A). Function g(-) 
is submodular if its increments are monotonically decreasing, meaning 5 A (p) > 5 9 B {p) for all 
A C B E V. Likewise, g(-) is supermodular iff 5 A (p) < 5 9 B (p) for all A C B E V. In the present 
case, the following proposition holds. 

Proposition 2. The function f(S) is monotonic and supermodular in S. 

The proof of Proposition 2 is provided in Appendix B, and relies on related results from [25]. 

An important implication of Proposition 2 is that a greedy forward selection algorithm can 
be developed to solve (16) approximately [27]. Upon defining the shifted function h(S) := 
f(S) — logdet(M(t — 1) + C v + C u + o" 2 Ip), a result from [27] ensures that the solution of the 
greedy algorithm S 9 (t) satisfies the inequality 



h(S 9 (t))< [l--)h(S*(t)). (18) 

While performance of the greedy algorithm is usually much better in practice, this bound ensures 
that it does not break down for pathological inputs. 

The greedy algorithm involves repeatedly performing the updates S ^— S U argmin pi ^5 5g(p) 
until | S | = S. This is useful in the present case, since the increments can be evaluated efficiently 
using determinant update rules. Specifically, the updates are given by 

^(p) = -log(l+[M(i-l) + C„ + cJ J VpeV (19) 

V L J p,pj 

^(p) = -logfl+ [((M(f -1) + C„ + Cg- 1 + S T Sr 1 l ) V P eV\S. (20) 

Further, each iteration requires a rank-one update to the matrix inverse in (20), which can 
also be performed efficiently. The full greedy approach is summarized in Algorithm 1, where 
$ := (M(t — 1) + C v + C„)/a 2 . Algorithm 1 involves only basic operations, and it is easy 
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to verify that its worst case complexity is 0(PS 3 ). Further, the final value of the matrix V 
evaluated in the last iteration (Algorithm 1, line 11) is exactly the inverse term required for 
evaluating the Kalman gain in (6). In fact, the operational complexity can be further reduced 
using lazy updates [28]. Finally, it is worth mentioning that the low-complexity of Algorithm 
1 is also a result of the random-walk model used here. In particular, the state space model 
x(t) = Bx(t — 1) + r](t) would result in significantly more complicated expressions. 



Algorithm 1 Greedy algorithm for solving (16) 



i: function Greed y($, S) 



2: 

3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 

11: 



s <— arg max [<&L D 

i<p<p 1 lp ' p 



V : = 

s 



V ([*]., + !)] 
W 



for k = 2 : S do 

w p <- $ S , P for all p e V \ S 

s <- arg max [*] PjP - w^Vw p 

P<£S 

d <- [*] a>a - wfVw s + 1 

u < yw s 

\ + uu T /d u/d 

V <- 

u T /d l/d 



II w p has entries [$] SiP for all s E S 



12: 



return S 



Next, consider a more practical scenario, where the software installed at each end-node can 
measure delays on all paths originating at that node. At any time t however, delays are measured 
from only N end-nodes. Let V e denote the set of all end-nodes, and V v , the set of paths which 
have the node v E V e as their origin (likewise, TV := UveN^v ^ or ^ c ^ or an y su bset M 
(and its complement AT :— V \ M), define the selection matrix N (N) consisting of canonical 
vectors ep" as rows, for all p E (p £ Vtf). Defining the cost function f n (M) := /(7-V)> the 
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online optimal design problem for this scenario is expressed as 

A/-*(t) = argm c in/ n (AO (21a) 

s. t. \AT\ = N. (21b) 

It follows from the properties of submodular functions that the cost function f n (N) is also 
monotonic and supermodular in J\f. In particular, observe that the increments 5^{v) = f n (J\fU 
{v}) — f n (N) = f(V_\f U V v ) — /(TV) for v ^ J\f satisfy the non-increasing property, i.e., 
$a( v ) — $b( v ) f° r all ^4 C i3 C V e and v ^ B. A greedy algorithm similar to Algorithm 1 can 
therefore be developed to obtain an approximate solution with the same (1 — 1/e) guarantee 
as in (18). Complexity of the greedy algorithm in this case would be however higher, since 
evaluating 5^{v) now requires rank- \V V \ updates in the determinant and inverses. Nevertheless, 
the algorithm would still be efficient as long as \V V \ « P for all v e V e . In the special case 
when delay measurements are performed by only one node per time slot (N — 1), the solution 
of (21a) is simply given by 

Af*(t) = arg min log det (l\ Vv \ + |m(* - 1) + C„ + cJ ) (22) 



where 



M(t - 1) + C v + C L 



is the \V V \ x \V V \ submatrix containing the rows and columns 



of ~M.it — 1) + C v + C„ corresponding to the paths in V v . 

In some networks, it may be relatively straightforward to install delay measurement software 
on every end-node, while allowing each end-node to measure delay on only one path per time 
slot. This amounts to replacing the budget-constraint (17) in (16) with 

\SnV v \ = i VveV e . (23) 

Interestingly, constraints of this form can also be handled using the greedy approach by simply 
imposing (23) while searching for the best increment at every iteration. Specifically, the search 
space of path p [cf. Algorithm 1, line 7] now becomes p e V\Vj^, where J\f — {v : SC\V V ^ 0}. 
More general constraints of the form |«S fl V v \ < S v can similarly be incorporated. Constraints 
of this form are referred to as partition matroid constraints, under which the greedy algorithm 
provides an approximation ratio of 1/2 [29]. 
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V. Empirical Validation 

Performance of the proposed network-wide latency prediction schemes is validated using two 
different datasets, which include delays measured on: 

(a) Internet2 backbone network 1 , a lightly loaded network that exhibits low delay variabil- 
ity; and, 

(b) New Zealand Active Measurement Project (NZ-AMP) 2 , a network deployed across 
several universities and ISPs in New Zealand, characterized by comparatively higher 
variability in delays. 

Using the aforementioned datasets, the performance of KKF is also compared against that of 
competing alternatives in [5] and [6]. 

Before proceeding, a brief description of the nonlinear estimation technique in [6] is provided. 
The approach hinges on a sparse representation of the network-wide delays, and employs £i-norm 
minimization to recover the sparse basis coefficient vector. Specifically, the path delays adhere 
to the postulated linear model y(t) = H/3(t), where ||/3(£)|| < P, and the matrix H e R PxP is 
constructed using diffusion wavelets [30]. The diffusion matrix used for computing the wavelet 
basis is obtained by applying Sinkhorn balancing [31] to the matrix W e M PxP , whose (p, q)-th 
element is defined as 

' ^ P,CI [G] pp + [G]gg - [G] pq 

where G is the Gramian defined in Sec. III-A. The overall algorithm amounts to solving the 
following minimization problem 

P'(t) =argmm||/3'|| 1 (25a) 

s. t. y s (t) = S(f)HL/3' (25b) 

where L is a diagonal matrix whose (n, n)-th entry is given by [L]„ jTl = 2 k , with k e N 
denoting the scale corresponding to the diffusion wavelet coefficient f3 n [6]. Subsequently, y§(t) 
is predicted as y s (t) = S(t)HL$'(t). 

'[Online] http : / / www . inter net 2 . edu /net work 
2 [Online] http://erg.cs.waikato.ac. nz/amp 
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Under the premise that delays change slowly with time, the described algorithm can be used to 
estimate y§(t) over a sequence of t > 1 contiguous time-steps jointly. In this case, problem (25) 
is solved by replacing y a (t) with y a (t) := [yj(t - r + 1), yj (t - r + 2), . . . , y^(t)] T , and by 
computing the Pt x Pt diffusion wavelet matrix based on W and temporal correlations as 
shown in [6]. Although this is a viable way to capture temporal correlations of delays, observe 
that it requires solving £i-norm minimization problems with Pt variables every r time slots. 
This increase in complexity prohibits the use of a large value of r, and the simulations here 
only report performance with r = 5. It is also worth mentioning that such a batch solution also 
does not compare favorably to a real-time implementation, such as that provided by the KKF 
where delay predictions become available every time new measurements arrive. 

A. Internet! Delay Data 

The One Way Active Measurement Project (OWAMP) collects one way delays on the Internet2 
backbone network 3 . The network has 9 end-nodes and 26 directional links as depicted in Fig. 1. 
Delays are measured on the 72 paths among the end-nodes every minute. The data {y(t)} is 
collected over t P = 4500 minutes (about three days) in July 2011. 

The model KKF covariances C„ and C v are estimated using data from the initial 1,000 time 
slots. In this phase, 50 paths are randomly selected per time slot. The KKF is initialized by 
setting 7 = 1, C v = C„, and run for 500 time slots. Next, j(t) and C v (t) are, updated in an 
online fashion, as outlined in Sec. III-A. The final values are obtained at the conclusion of the 
training phase at t = 1,000. 

Pictorially, the performance of different algorithms can be assessed through delay maps 
shown in 2. Such maps can succinctly represent the network health, and are especially useful 
for networks which otherwise have low delay variability, such as the Internet2. The map in 
Fig. 2(a) corresponds to the true delays, wheres maps (b), (c), and (d) depict the predicted values 
obtained from the network kriging, wavelet-based approach, and KKF respectively. Predictions 
are performed using measurements over an interval of 100 minutes on 10 random paths (same 

3 [Online] http : / /ndbl . net . internet2 . edu/ cgi-bin/ owamp . cgi 
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Seattle 



Los Angeles 




Houston 



Fig. 1. Internet2 IP backbone network. 

paths are used throughout the considered interval), and the delays are predicted on the remaining 
62 paths are reported. In these maps, paths are arranged in increasing order according to the true 
delay at time t = 1. It can be seen that the map produced by the kriging and compressive sensing 
approaches are very different from the true map. In contrast, the map obtained when using the 
KKF is close to the true map. In particular, observe that the delays of several paths change 
slightly around t = 80 in Fig. 2(a). However, of the three maps, this change is only discernible 
in the KKF map in 2(d). The delay predictions provided by the KKF are thus sufficiently accurate 
for human inspection at control centers, even when monitoring a few paths. 

It should be remarked that the maps in Fig. 2 are only for demonstration purposes, and not 
much can be inferred about the relative performance of different algorithms from these depictions 
alone. For a more detailed analysis of the different delay prediction approaches, consider the 
normalized mean- square prediction error (NMSPE), defined as 



The prediction performance of the three algorithms is first assessed by using delay measurements 




(26) 



March 3. 2013 



DRAFT 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 



17 




(c) Wavelets (d) KKF 

Fig. 2. True and predicted delay map for 62 paths in the Internet2 network over in interval of 100 minutes. 



on randomly selected paths for each t. The (same) randomly selected paths are used for all three 
approaches. Fig. 3 depicts the NMSPE as a function of S, the number of paths on which delays 
are measured. Clearly, the KKF markedly outperforms the other two approaches across the entire 
range of S. As expected [6], the compressive sampling-based approach provides a more accurate 
prediction than network kriging. 

Next, the performance of the three algorithms is analyzed for the case when paths for delay 
measurement are selected optimally. For the network kriging and the wavelet-based approaches, 
the optimal paths are obtained according to the selection procedures provided in [5] and [6], 
respectively. As pointed out in [6], performance of the wavelet-based approach can be improved 
by capitalizing on temporal correlations. This is done by solving (25) using measurements from 
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Fig. 3. NMSPE as a function of S, Internet2 network with random path selection. 



t = 5 consecutive time slots in a batch form. The temporal correlation is set to 0.5 and the 
optimal paths are obtained again using the selection strategy outlined in [6]. For the KKF, optimal 
paths are selected in an online fashion using Algorithm 1. Again, a significantly more accurate 
prediction of the path delays for the entire range of S is obtained via the KKF. 

B. NZ-AMP Delay Data 

The KKF algorithm is tested here using delay data from NZ-AMP. The project continuously 
runs I CMP and scamper to determine the topology and delays between a set of nodes in New 
Zealand. The data collected for this paper consist of end-to-end delays measured every ten 
minutes over the month of August 2011. The network has a total of 186 paths, whose delays 
range from almost constant to highly variable, at times reaching up to 250ms. 

In Fig. 5, the NMSPE as a function of S is reported, for the case where paths that are to be 
measured are chosen randomly. Again, same paths are used for the three considered schemes. 
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Fig. 4. NMSPE as a function of S, Internet2 network with optimal path selection. 



The KKF provides a markedly lower prediction error also for the NZ-AMP delay data. On the 
other hand, Fig. 6 shows the NMSPE on optimally selected paths for all three schemes. The 
KKF performs relatively better than the competing schemes for this data set as well. Observe 
though that the actual values of the NMSPE incurred for this dataset is at least an order of 
magnitude higher than those in the Internet2 dataset. Indeed, given the high variability in the 
data, it is possible to improve upon the prediction accuracy of KKF by training it better. This is 
showcased by the considerably lower prediction error curve for training interval ££=2,000 shown 
in Fig. 6. 

While the NMSPE is useful for characterizing the average performance, network operators 
are also interested in the prediction accuracy over the entire range of delay values. Towards 
this end, Fig. 7 shows the scatter plots of ys(t) versus ys(t) for all t and S = 30 optimally 
selected paths. The points cluster around the 45-degree line y§(t) = ys(t), and the thinner the 
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Fig. 5. NMSPE as a function of S, NZ-AMP network with random path selection. 



"cloud" of points is, the more accurate the estimates are. Indeed, it can be seen that the points 
generated from the KKF estimates are crammed in a very close area around the 45-degree line, 
and accurate estimates are produced for the entire range of experienced delays. Furthermore, the 
scatter plots corroborate the unbiasedness of the KKF predictor. 

VI. Conclusion 

The present paper develops a spatio-temporal prediction approach to track and predict network- 
wide path delays using measurements on only a few paths. The proposed algorithm adapts a 
kriged Kalman filter that exploits both topological as well as historical data. The framework also 
allows for the use of submodular optimization in the selection of optimal delay measurement 
locations. The problem of path selection is formulated for different types of constraints on the set 
of selected paths, and solved in an online fashion to near-op timality. The resulting predictor is 
validated on two datasets with different delay profiles, and is shown to substantially outperform 
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Fig. 6. NMSPE as a function of S, NZ-AMP network with optimal path selection. 



competing alternatives. 

Appendix A 
Error covariance matrix 

Towards deriving an expression for Mf (t), observe that the prediction error can be written as 

y s (t) - Ut) = mx(t) + S(t)»(t) + e- s (t) - S(t)x(t) 

- S(i)C„S T (t) (S(t)C„S T {t) + o^Is)- 1 [y s (t) - S{t)x{t)\ (27) 

- S(t)C„S T (t) (S(t)C„S T (t) + aHsy 1 [S{t)( X {t) - m + u(t)) + e s (t)) . 

(28) 
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Fig. 7. Scatter plot for the NZ-AMP network, S = 30 with optimal path selection. 



Using (5a), the term x(t) — x(t) can De written as 

X(t) - m = X(t) - X(t - 1) - K(t) [S(t)( X (t) + u(t)) + e s (t) - S(t) X (t - 1)] 
= X(t) - X(t - 1) + K(t)S(t)( X (t) - X (t - 1) + u{t)) + K(t)e s (t) 
= (I P - K(t)S(t))x(t) - K(t)S(t)u(t) - K(t)e s (t) (2< 
where x(t) := x(t) ~ x(t ~ !)• Substituting (29) in (28), it follows that 
y,-(t) - h(t) = S(t)(Ip - K(t)S(t))( X (t) + u(t)) - S(t)K(t)e s (t) + e s {t) 
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- S(t)C„S T (t) (S(t)C u S T (t) + a 2 !?)- 1 

x [S(t)(I P - K(f)S(t))(*(*) + v{t)) - S(t)K(t)e s (t) + e s (t)} (30) 
= S(t)(Ip-K(0S(t))(x(0 + 

- S(*)C„S T (t) (S(t)C„S T (t) + S(*)(I P - K(t)S(t))(x(t) + 

- S(t)K(t)e s (t) - S(t)C u S T (t) (S(t)C„S T (t) + aHs)' 1 (I s - S(t)K(t))e s (t) 
+ e s (t) (31) 

which, after some manipulations, can be expressed as 

y,(t) - y s (t) = s(0(ip - Q(*)S(*))(x(t) + 1/(0) + Q(*)«*(0 + (32) 

where 

Q(f) := K(t) + C„S(t)(S(t)C u S T (t) + aHg)- 1 - C„S(*)(S(t)C„S T (t) + ( r 2 I 5 )- 1 S(t)K(t). 

(33) 

Next, substituting for K(i) from (6), the expression for Q(t) simplifies to 

Q(t) = (M(t - 1) + C T7 )S r (0 [S(t)(M(* - 1) + C„ + C,)S T (t) + a 2 I 5 ] _1 

+ as T (o(s(oas T (o+a 2 i 5 )- 1 

- C,S T (t)(S(t)C i ,S T (t) + a 2 I 5 )~ 1 S(t)(M(t - 1) + C v )S T (t) 

x [S(0 (M(t - 1) + C,, + C„)S T (t) + a%] 1 (34) 
= (M(t - 1) + C,, + C„)S T (*) [S(t)(M(t - 1) + C„ + C,)S T (t) + <7 2 I 5 ] _1 . (35) 

Utilizing the fact that x(t), ^(0' e s(0> an ^ e s(0 are mutually uncorrelated, with E [x(t)x T (t)j := 
M(t — 1) + C v , the error covariance matrix Mf (t) becomes 

M?(t) = E [(y 5 (i) - y s -(t))(y s (t) - y a -(t)) T ] (36) 
= S(0(Ip - Q(t)S(t))(M(t - 1) + C, + C„)(I P - S T (t)Q T (t))S T (t) 

+ a 2 S(t)Q(t)Q T (t)S T (t) + a 2 Ip_ 5 (37) 
= S(t)(M(t - 1) + C„ + C v )S T (t) - 2S(t)Q(t)S(t)(M(t - 1) + C„ + C v )S T (t) 
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+ S(*)Q(*)S(*)(M(t - 1) + C„ + C u )S T (t)Q T (t)S T (t) + a 2 S(t)Q(t)Q T (t)S T (t) 
+ ^ 2 Ip-5 (38) 

= S(t)(M(t - 1) + C„ + C v )S T (t) - S(t)Q(t)S(t)(M(t - 1) + C„ + Cr,)S T (t) 

+ v 2 Ip-s. (39) 

Substituting for Q(t) [cf. (35)] in (39), and using the Woodbury matrix identity [32], the final 
expression for Mf (t) becomes 

MUt) = a 2 I P „ s + S(t) 



(M(t - 1) + c„ + eg 1 + ls T (t)s(t) 



n -1 

T 



S J (t) . (40) 



Appendix B 

Proof of monotonicity and supermodularity of / 
Let $ := 4j-(M(i — 1) + Cr, + C v ), and observe that / can be written as 

f(S) = \og(a 2 ) + log det [Ip_ s + S($- 1 + S T S)- 1 S T ] (41a) 
= log(a 2 ) + log det [I P + S T S + S T S) - 1 ] (41b) 
= log(a 2 ) + log det + S T S + S T S] + log det + S T S)- X ] (41c) 

where (41b) follows from Sylvester's theorem for determinants [32]. 
Observing that S T S + S T S = 1 P , it is possible to write /(«S) as 

f{S) = log{a 2 ) + log det($" 1 + I P ) - log det + S T S) . (42) 

Next, consider the decomposition $ = UU T , and define the shifted function 

h(S) := f(S) - \og(a 2 ) - log det ($ + I P ) (43a) 

= - log det(Ip + S T S$) (43b) 

= - log det [I s + (SU) (SUf] (43c) 

where Sylvester's theorem has again been used in (43c). Finally, it is well known that a function 
of the form log det (Ip + (SU) T (SU)) is non-decreasing and submodular (see e.g., [25]), which 
allows one to deduce that f(S) is non-increasing and supermodular. Note further that the greedy 
approach from [27] can be used on h(S) by defining h($) = 0. 

March 3, 2013 DRAFT 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 



25 



References 

[1] H. Singhal and G. Michailidis, "Structural models for dual modality data with application to network tomography," IEEE 

Trans. Inf. Theory, vol. 57, no. 8, pp. 5054-5071, Aug. 2011. 
[2] M. G. Rabbat, M. A. T. Figueiredo, and R. D. Nowak, "Network inference from co-occurrences," IEEE Trans. Inf. Theory, 

vol. 54, no. 9, pp. 4053^1068, Sep. 2008. 
[3] B. D. Ripley, Spatial Statistics. John Wiley & Sons, 1981. 

[4] N. Cressie, "The origins of kriging," Mathematical Geology, vol. 22, no. 3, pp. 239-252, 1990. 

[5] D. B. Chua, E. D. Kolaczyk, and M. Crovella, "Network kriging," IEEE J. Sel. Areas Commun., vol. 24, no. 12, pp. 
2263-2272, Dec. 2006. 

[6] M. Coates, Y. Pointurier, and M. Rabbat, "Compressed network monitoring for IP and all-optical networks," in Proc. ACM 

Internet Measurement Conf, San Diego, CA, Oct. 2007. 
[7] W. Xu, E. Mallada, and A. Tang, "Compressive sensing over graphs," in Proc. of IEEE 1NFOCOM, Shanghai, China, Apr. 

2011, pp. 2087-2095. 

[8] F. Dabek, R. Cox, F. Kaashoek, and R. Morris, "Vivaldi: A decentralized network coordinate system," in Proc. of the ACM 

SIGCOMM, Portland, Oregon, USA, 2004, pp. 15-26. 
[9] Y. Liao, P. Geurts, and G. Leduc, "Network distance prediction based on decentralized matrix factorization," in Proc. of 

IFIP Networking, Chennai, India, May 2010. 
[10] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft, "Structural analysis of network traffic 

flows," in Proc. of ACM SIGMETRICS, New York, NY, 2004, pp. 61-72. 
[11] C. J. Bovy, H. T. Mertodimedjo, G. Hooghiemstra, H. Uijterwaal, and P. van Mieghem, "Analysis of end-to-end delay 

measurements in Internet," in Proc. Passive and Active Measurement Workshop, Fort Collins, CO, Apr. 2002. 
[12] K. V. Mardia, C. Goodall, E. J. Redfern, and F. J. Alonso, "The Kriged Kalman filter," Test, vol. 7, no. 2, pp. 217-285, 

Dec. 1998. 

[13] C. K. Wikle and N. Cressie, "A dimension-reduced approach to space-time Kalman filtering," Biometrika, vol. 86, no. 4, 
pp. 815-829, 1999. 

[14] S.-J. Kim, E. Dall'Anese, and G. B. Giannakis, "Cooperative spectrum sensing for cognitive radios using Kriged Kalman 

filtering," IEEE J. Sel. Topics Signal Process., vol. 5, no. 1, pp. 24-36, Feb. 2011. 
[15] J. Cortes, "Distributed Kriged Kalman filter for spatial estimation," IEEE Trans. Auto. Contr., vol. 54, no. 12, pp. 2816- 

2827, Dec. 2009. 

[16] B. D. O. Anderson and J. B. Moore, Optimal Filtering. Englewood Cliffs, NJ: Prentice-Hall, 1979. 

[17] P. Casas, S. Vaton, L. Fillatre, and T. Chonavel, "Efficient methods for traffic matrix modeling and on-line estimation in 

large-scale ip networks," in Proc. of the 21st Intl. Teletraffic Congress, Paris, France, 2009, pp. 45-54. 
[18] E. Dall'Anese, S.-J. Kim, and G. Giannakis, "Channel gain map tracking via distributed kriging," IEEE Trans. Veh. Technol., 

vol. 60, no. 3, pp. 1205-1211, Mar. 2011. 
[19] I. D. Schizas, A. Ribeiro, and G. B. Giannakis, "Consensus in ad hoc WSNs with noisy links - part I: Distributed estimation 

of deterministic signals," IEEE Trans. Signal Process., vol. 56, no. 1, pp. 350-364, Jan. 2008. 



March 3, 2013 



DRAFT 



IEEE TRANSACTIONS ON INFORMATION THEORY (SUBMITTED) 



26 



[20] R. Mehra, "On the identification of variances and adaptive Kalman filtering," IEEE Trans. Autom. Control, vol. 15, no. 2, 
pp. 175-184, Apr. 1970. 

[21] K. Myers and B. Tapley, "Adaptive sequential estimation with unknown noise statistics," IEEE Trans. Autom. Control, 

vol. 21, no. 4, pp. 520-523, Aug. 1976. 
[22] M. K. Tsatsanis and G. B. Giannakis, "Modeling and equalization of rapidly fading channels," Intl. J. Adaptive Control 

and Signal Process., vol. 10, no. 2-3, pp. 159-176, 1996. 
[23] M. K. Tsatsanis, G. B. Giannakis, and G. Zhou, "Modeling and equalization of rapidly fading channels," Signal Process., 

vol. 53, no. 2-3, pp. 211-229, 1996. 
[24] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, New York, 2006. 

[25] F. Bach, "Learning with submodular functions: A convex optimization perspective," Foundations and Trends in Machine 

Learning, 2012. [Online]. Available: http://arxiv.org/abs/llll.6453 
[26] A. Das and D. Kempe, "Algorithms for subset selection in linear regression," in Proc. of the ACM Symp. on Theory of 

Computing, Victoria, British Columbia, Canada, May 2008, pp. 45-54. 
[27] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, "An analysis of approximations for maximizing submodular set functions 

- I," Mathematical Programming, no. 1, pp. 265-294, Dec. 1978. 

[28] M. Minoux, "Accelerated greedy algorithms for maximizing submodular set functions," in Optimization Techniques, ser. 

Lecture Notes in Control and Information Sciences, J. Stoer, Ed. Springer Berlin / Heidelberg, 1978, vol. 7, pp. 234-243. 
[29] M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey, "An analysis of approximations for maximizing submodular set functions 

- II," Mathematical Programming Study, pp. 73-87, 1978. 

[30] R. R. Coifman and M. Maggioni, "Diffusion wavelets," Applied and Computational Harmonic Analysis, vol. 21, no. 1, 
pp. 53-94, 2006. 

[31] R. Sinkhorn, "A relationship between arbitrary positive matrices and doubly stochastic matrices," The Annals of 

Mathematical Statistics, vol. 35, no. 2, pp. 876-879, 1964. 
[32] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Johns Hopkins University Press, 1996. 



March 3, 2013 



DRAFT 



